Preparation

Before we can start mapping in R, we need to load the libraries we use. We do this once at the beginning for all libraries used throughout this tutorial. Packages can be installed using the install.packages() function. An example is install.packages(‘tidyverse’).

library(HistData)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.6     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lingtypology)
library(sf)
## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(rworldmap)
## Loading required package: sp
## ### Welcome to rworldmap ###
## For a short introduction type :   vignette('rworldmap')
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(classInt)

We will demonstrate the core logic of ggplot2 with a famous non-linguistic dataset before proceeding to plotting map data. Francis Galton compared the age of parents to the age of their children. This dataset is built into the R package HistData and we can access it using the data() function as follows and we can use the head() function to show the first six lines of this dataset.

data("Galton")
head(Galton)
##   parent child
## 1   70.5  61.7
## 2   68.5  61.7
## 3   65.5  61.7
## 4   64.5  61.7
## 5   64.0  61.7
## 6   67.5  62.2

First, the Galton dataset is ‘piped’ to the ggplot() function using ‘%>%’, which is called a ‘pipe’. To specify points, we need their x- and y-values, so the aes() function instructs ggplot to draw from the parent and child column respectively. The ‘+’ symbol is used to add to the plot, in our case geom_point(), which draws from the specification of aesthetic mappings supplied in the first ggplot() command.

Galton %>%
    ggplot(mapping = aes(x = parent, y = child)) + geom_point()

To increase the understandability of our plot, we can add random jitter to x- and y-values: one way of achieving this is by swapping geom_point() with geom_jitter(). We can additionally specify the argument ‘alpha = 0.5’, which controls the opacity of the points. Supplying the value 0.5 means that the points will be 50% opaque, thereby making dense regions darker. Finally, we add theme_classic(), which is a high-level property of the plot reducing visual clutter, such as the grey tiles ggplot2 plots by default.

Galton %>%
    ggplot(mapping = aes(x = parent, y = child)) + geom_jitter(alpha = 0.5) + theme_classic()

World Map

Without much effort we can create an interactive map based on the package lingtypology. The map.feature() function from the lingtypology package can take a list of language names, as is done via the ‘concatenate’ function c() below. The resultant map shows one dot for each language.

map.feature(c("Finnish", "Karelian", "Swedish", "Estonian", "Danish", "North Saami"))

To create our first self-designed map within the ggplot2 framework, we will start by creating an empty world map to which we can then add our linguistic data as a layer in a second step. To retrieve the data for our base layer, we will use the map_data() function. We store this information in an R object named world.

world <- map_data("world")
head(world)
##        long      lat group order region subregion
## 1 -69.89912 12.45200     1     1  Aruba      <NA>
## 2 -69.89571 12.42300     1     2  Aruba      <NA>
## 3 -69.94219 12.43853     1     3  Aruba      <NA>
## 4 -70.00415 12.50049     1     4  Aruba      <NA>
## 5 -70.06612 12.54697     1     5  Aruba      <NA>
## 6 -70.05088 12.59707     1     6  Aruba      <NA>

With this data, we can produce a first map of the world. For this, we provide the ggplot() function with the data world we created and we specify that longitude and latitude will be used as the x- and y-axis units in the aesthetics. With the geom_map() function we instruct R to actually map the world data.

world_plot <- ggplot(data = world, aes(x = long, y = lat)) + geom_map(map = world,
    aes(map_id = region))
world_plot

Now that we have our base map, we can add linguistic data to it. We will use data from the World Atlas of Language Structures (Dryer & Haspelmath, 2013) and we use the wals.feature() function, which allows us to import features from the Atlas. Here, we will import feature 81a, which is word order and store it as word_order.

word_order <- wals.feature(c("81a"))
## Don't forget to cite a source (modify in case of using individual chapters):
## 
## Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology.
## (Available online at https://wals.info/, Accessed on 2023-01-11.)
## 
## @book{wals,
##   address   = {Leipzig},
##   editor    = {Matthew S. Dryer and Martin Haspelmath},
##   publisher = {Max Planck Institute for Evolutionary Anthropology},
##   title     = {WALS Online},
##   url       = {https://wals.info/},
##   year      = {2013}
## }

In the next step we add this information in another layer to our map.

word_order_plot <- world_plot + geom_point(data = word_order, aes(x = longitude,
    y = latitude, color = `81a`))
word_order_plot

We will also style the map a little. We add a minimal theme and change some design aspects.

word_order_plot + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5)) +
    scale_color_brewer(palette = "Set2") + ggtitle("Word Order in World Languages") +
    xlab("Latitude") + ylab("Longitude")

USA Swearing Map

In the next part of this tutorial, we will focus on visualizing a more specific region to show how to narrow down the area of the map. To start, we import the geographical county information for the USA. To achieve this, we first retrieve the relevant location data in an object we call us_geo. The st_as_sf() function changes the way the map data is stored. Instead of having each point of longitude and latitude as its own row, we now store all location details for one county in one row.

us_geo <- st_as_sf(maps::map(database = "county", plot = FALSE, fill = TRUE))
head(us_geo)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686
## Geodetic CRS:  WGS 84
##                ID                           geom
## 1 alabama,autauga MULTIPOLYGON (((-86.50517 3...
## 2 alabama,baldwin MULTIPOLYGON (((-87.93757 3...
## 3 alabama,barbour MULTIPOLYGON (((-85.42801 3...
## 4    alabama,bibb MULTIPOLYGON (((-87.02083 3...
## 5  alabama,blount MULTIPOLYGON (((-86.9578 33...
## 6 alabama,bullock MULTIPOLYGON (((-85.66866 3...

We can now easily map the counties of the US.

ggplot(data = us_geo) + geom_sf()

The next step is importing data to map to the counties. We do this using the read_csv() function, which in this case grabs the data from a URL using the url() function.

us_data <- read_csv(url("https://raw.githubusercontent.com/danaroemling/mapping-for-linguists/main/MAPPING_SWEARING.csv"))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   county = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
us_data
## # A tibble: 3,076 x 53
##    county    ass asshole bastard  bitch bitched bitchy bloody bullshit  cock
##    <chr>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl> <dbl>
##  1 alaba… 1.52e6   49600    9538 9.62e5    6995   8903   5087   120184 15897
##  2 alaba… 1.25e6   54318    6578 8.07e5    2334   7851  14004    98452  7002
##  3 alaba… 2.26e6   29188    3243 9.60e5    3243   6486   3243    74591  3243
##  4 alaba… 1.45e6   14629    2926 1.01e6       0   8777      0   105328     0
##  5 alaba… 5.59e5   72969    4230 5.07e5    2115   5288   3173   101523  9518
##  6 alaba… 2.17e6   56605       0 1.18e6       0   8708      0   182878  8708
##  7 alaba… 2.64e6   38680   11282 1.81e6    6447   4835   3223   164390  9670
##  8 alaba… 1.60e6   38763    8012 9.18e5    2166   5197   4115    95500  8446
##  9 alaba… 1.88e6   34756    5902 1.44e6    1312   1967  20329   155419  7869
## 10 alaba… 3.80e5   37028    1683 2.73e5    1683   6732   5049    40394  1683
## # … with 3,066 more rows, and 43 more variables: crap <dbl>, crappy <dbl>,
## #   cunt <dbl>, damn <dbl>, damnit <dbl>, damned <dbl>, darn <dbl>, dick <dbl>,
## #   dickhead <dbl>, douche <dbl>, douchebag <dbl>, dumbass <dbl>, dyke <dbl>,
## #   fag <dbl>, faggot <dbl>, fatass <dbl>, freaking <dbl>, friggin <dbl>,
## #   fuck <dbl>, fucked <dbl>, fucker <dbl>, fuckery <dbl>, fucking <dbl>,
## #   goddamn <dbl>, gosh <dbl>, hell <dbl>, hoe <dbl>, homo <dbl>,
## #   jackass <dbl>, motherfucker <dbl>, motherfucking <dbl>, nigger <dbl>,
## #   piss <dbl>, pissed <dbl>, pissy <dbl>, pussy <dbl>, shit <dbl>,
## #   shittiest <dbl>, shitty <dbl>, slut <dbl>, slutty <dbl>, twat <dbl>,
## #   whore <dbl>

We now create a new object us_geo_data and match the information stored in the old objects by ID and county, so that geolocation information and linguistic data are combined.

us_geo_data <- left_join(us_geo, us_data, by = c(ID = "county"))

Finally, we plot the new object us_geo_data. Just like above, we pass our data to the ggplot() function and add geom_sf() since we’re handling polygons. Now, we also pass aesthetics to geom_sf() to say that the income should be mapped as the fill of the counties.

ggplot(data = us_geo_data) + geom_sf(aes(fill = fuck))

To make our map more informative, we are introducing class intervals for our data. First, we create the intervals.

f_quantiles <- classIntervals(us_geo_data$fuck, n = 5, style = "quantile")
f_quantiles
## style: quantile
##        [0,889713)  [889713,1243106) [1243106,1557093) [1557093,1934757) 
##               615               615               615               615 
## [1934757,9527509] 
##               616

Then we assign each data point its corresponding interval.

us_geo_data <- mutate(us_geo_data, f_quantile = cut(fuck, f_quantiles$brks, include.lowest = TRUE,
    dig.lab = 10))

In the next step we will use the class intervals we have introduced for mapping.

ggplot(data = us_geo_data) + geom_sf(aes(fill = f_quantile), lwd = 0.1, color = "grey") +
    scale_fill_brewer(palette = "Purples")

In the last step we style the map.

ggplot(data = us_geo_data) + geom_sf(aes(fill = f_quantile), lwd = 0.1, color = "grey") +
    scale_fill_brewer(palette = "Purples") + coord_sf(crs = "ESRI:102003") + ggtitle("'Fuck' Distribution in the US per County") +
    xlab("Latitude") + ylab("Longitude") + labs(fill = "Intervals") + theme_minimal()

German-speaking Area Map

We again use read_csv to create a tibble with cities, two features and their counts, the corresponding proportion of the features and geolocation details. We can type gsa_data to check our new data.

gsa_data <- read_csv(url("https://raw.githubusercontent.com/danaroemling/mapping-for-linguists/main/MAPPING_DIALECT.csv"))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   City = col_character(),
##   Token1 = col_character(),
##   Count1 = col_double(),
##   Token2 = col_character(),
##   Count2 = col_double(),
##   Proportion = col_double(),
##   Long = col_double(),
##   Lat = col_double()
## )
gsa_data
## # A tibble: 350 x 8
##    City       Token1 Count1 Token2 Count2 Proportion  Long   Lat
##    <chr>      <chr>   <dbl> <chr>   <dbl>      <dbl> <dbl> <dbl>
##  1 Aachen     schau     117 guck      139       45.7  6.08  50.8
##  2 Aalen      schau      12 guck        4       75   10.1   48.8
##  3 Abstatt    schau       1 guck        1       50    9.29  49.1
##  4 Ahlen      schau       3 guck        1       75    7.90  51.8
##  5 Albstadt   schau       1 guck        1       50    9.02  48.2
##  6 Alfter     schau       2 guck        1       66.7  7.01  50.7
##  7 Altenbeken schau       1 guck        1       50    8.95  51.8
##  8 Altenholz  schau       2 guck        3       40   10.1   54.4
##  9 Ammerbuch  schau       3 guck        1       75    8.97  48.6
## 10 Andernach  schau       2 guck        1       66.7  7.41  50.4
## # … with 340 more rows

In order to map this data, we will need a base map of the GSA first. Similar to the world data available in ggplot2, rworldmap provides us with coordinates we can use to create this map. We store the coordinates in the world_map object. We then define the GSA as Austria, Germany and Switzerland by concatenating them in the object we call GSA using c(). The next code chunk gets the coordinates for the previously defined GSA and saves them in the object GSA_coord. If you want to adapt this code for your own maps, you will need to change the string assigned to the GSA object. For example, you could assign c(“Finland”, “Iceland”, “Estonia”, “Denmark”) This is the list of countries for which the longer code chunk finds the coordinates. The code was adapted from this tutorial: https://egallic.fr/en/european-map-using-r/.

world_map <- getMap()
GSA <- c("Austria", "Germany", "Switzerland")
GSA_map <- which(world_map$NAME %in% GSA)
GSA_coord <- lapply(GSA_map, function(i) {
    df <- data.frame(world_map@polygons[[i]]@Polygons[[1]]@coords)
    df$region = as.character(world_map$NAME[i])
    colnames(df) <- list("long", "lat", "region")
    return(df)
})
GSA_coord <- do.call("rbind", GSA_coord)

After this bit we can use the coordinates to make the base layer of the GSA map.

ggplot(data = GSA_coord) + geom_polygon(aes(x = long, y = lat, group = region)) +
    coord_fixed(1.3)

In the last step we can combine the base layer with the data in gsa_data. In the geom_polygon() function we first pass the aesthetics, saying that longitude and latitude should be mapped and that the polygons should be grouped by their region, which in this case is just the national borders of the three countries. In the next step, the data is added in the geom_point() layer, just as above. We pass the data argument and again we pass aesthetics with longitude and latitude. We also want R to map the proportion of our counts to the color argument and the size argument reflects the combined count of our features.

ggplot(data = GSA_coord) + geom_polygon(aes(x = long, y = lat, group = region), color = "black",
    size = 0.1, fill = "snow3") + coord_fixed(1.3) + geom_point(data = gsa_data,
    aes(x = Long, y = Lat, color = Proportion, size = (Count1 + Count2))) + scale_color_gradient(low = "seagreen3",
    high = "mediumpurple3") + guides(size = "none") + ggtitle("Schau vs Guck in the GSA") +
    xlab("Latitude") + ylab("Longitude") + theme_minimal()